Introduction
The Seaside Hotel has asked to build a machine learning model which can predict booking cancellations based on the available data. This report aims to identify the best model by comparing different models and their performance. A final model will be recommended, which the hotel can implement and deploy, as a means to optimize the planning of staff, inventory and marketing efforts.
Data understanding
The Seaside Hotel has provided a train data set on which several models can be trained and evaluated. The target variable that should be predicted is is_cancelled. This target variable contains two classes: yes and no. The data shows that there is some imbalance in these classes, around 42% and 58% respectively. Therefore, the AUC metric is used instead of Accuracy to select models during model training. The class imbalance is also important to take into consideration when making the class predictions.
The entire data set is split into two smaller sets, namely a train set of 60% and a test set of 40%. Stratified splitting was used for this, meaning that the proportions of the classes of the target variable in the sets are similar to that of the entire data set. There are 23799 observations in the train set and 15864 observations in the test set. For now, the test set is set aside and will only be used after the models have been trained.
When exploring the data, some insights could be derived. The data contains no metadata nor variables that could potentially introduce data leakage. Also no major issues with multicollinearity could be found when assessing whether there are predictors which are highly correlated with one other (Appendix 6.7). Thus, no features are removed beforehand.
After looking at the distributions of the classes of the target variable for the numerical variables, one can already get an idea of what could be potentially promising features (Appendix 6.7). For example, the lead time and number of required parking spaces could possibly be good predictors for whether a person will cancel a booking or not, due to the good separation of classes. Despite these initial insights, all features are considered when building the predictive models.
When looking at the categorical variables in the data (Appendix 6.8), mainly country seems to have many categories (137 in total). Thus, multiple categories will be consolidated to significantly reduce the number of infrequent categories. Moreover, meal, market_segment, deposit_type and customer_type each have a category with a relatively low frequency (p < 0.01). These categories will also be combined with another category, as to prevent potential problems when resampling.
Model selection
To determine what the best predictive model is, different models are trained using a train set, after which their performance is evaluated using the test set.
As mentioned earlier, preprocessing is first done by merging some categories together and then dummy encoding all categorical variables after. Yeo-Johnson transformations were applied to numerical variables, since they also contain zero’s. This will make features that are skew, less skew. The recipe, containing these preprocessing steps, was prepared on the train set and then baked on the train set as well as on the test set. No normalization was needed for the machine learning models that were trained in this report. The recipe can be found in Appendix 6.1.
Model training
This section will dive deeper into the trained models. The models were trained using the smaller train set. See Appendix 6.2-6.5 for the code of the models.
Cross-validation was performed besides using a separate test set for model assessment. Where applicable, the one standard error rule is applied to select the value of the tuning parameter. To provide an overview, the following settings were used to train the models:
| Resampling method |
Cross-validation |
| Number of folds |
10 |
| Selection method |
One standard error rule |
| Metric |
AUC |
Backward stepwise regression
A glm model with backward stepwise feature selection was trained, using 10-fold cross-validation. The AIC indicated which predictors make the model more complicated. The final model resulting from this training had a ROC of 0.891. The 10 most important variables according to this model can be seen below.
| |
Overall |
| total_of_special_requests |
33.64804 |
| country_PRT |
29.29297 |
| lead_time |
27.58251 |
| previous_cancellations |
15.74839 |
| deposit_type_Non.Refund |
13.41066 |
| adr |
12.45370 |
| market_segment_Online.TA |
12.28345 |
| country_ITA |
12.09257 |
| country_other |
11.69210 |
| country_CHN |
11.55897 |
This shows that the most important predictor for whether a booking will be canceled or not, is the number of special requests. The lead time is also a good predictor, as could also be seen earlier during data exploration. Moreover, the countries Portugal, Italy and China also seem to be good predictors for is_cancelled.
Lasso regression
Next, a lasso regression was trained, also using 10-fold cross validation. This model is also able to identify (ir)relevant variables by shrinking those that don’t have a lot of influence on is_cancelled towards 0. The optimal model was selected using the one SE rule and had a lambda value of 0.0023357 and ROC of 0.889. The selected value of lambda can be seen in the following graph as well:

The estimated coefficients of the final models showed that 11 variables out of the 53 were set to 0. Months such as March, October and November were removed. Reason for that could be because these months fall in the low season of the hotel industry. When looking at the importance of variables, a few conclusions can be made. In this model, the deposit type Non-refund is the most important predictor. This predictor was also in the top 5 of most important variables in the backward stepwise regression. Furthermore, the total number of special requests, Portugal, China, previous cancellations and the online travel agent market segment are considered important in both models when comparing them in terms of variable importance. As such, there are a good number of predictors on which both models agree that they are relevant in predicting hotel bookings cancellations.
| |
Overall |
| deposit_type_Non.Refund |
4.2873172 |
| required_car_parking_spaces |
3.5369380 |
| total_of_special_requests |
3.0081430 |
| country_PRT |
1.6439634 |
| meal_other |
1.5036620 |
| previous_cancellations |
1.3156475 |
| country_CHN |
1.1433875 |
| market_segment_Online.TA |
1.0976499 |
| country_BRA |
0.7003225 |
| customer_type_Transient.Party |
0.6474984 |
Decision tree
A decision tree was build and trained using 10-fold cross validation. The selected model had a value of 0.003726 for cp with a ROC of 0.875. Pruning was performed using the one standard error rule. The decision tree looks as follows:

The variable importance in this model can be seen in the table below. Again, consistent with previous models deposit_type_Non.Refund, total_of_special_requests, country_PRT, lead_time, previous_cancellations, market_segment_Online.TA and customer_type_Transient.Party seem to be important predictors.
| |
Overall |
| deposit_type_Non.Refund |
3160.1062 |
| total_of_special_requests |
2325.2752 |
| country_PRT |
2213.5471 |
| lead_time |
2087.6614 |
| previous_cancellations |
1287.8414 |
| market_segment_Online.TA |
595.2344 |
| arrival_date_year |
526.5245 |
| market_segment_Other |
381.7769 |
| required_car_parking_spaces |
266.7346 |
| customer_type_Transient.Party |
252.5998 |
Boosted classification tree
Lastly, a boosted classification tree was trained using 10-fold cross validation. The tuning grid to train the model was as follows:
| Boosting iterations |
1000, 2500, 3000 |
| Interaction depth |
1, 2, 3 |
| Shrinkage |
0.01, 0.1 |
A summary of the training results can be seen below:

The final model that was eventually selected based on the one SE rule has 1000 boosting iterations, an interaction depth of 3, and a shrinkage parameter of 0.01. The corresponding ROC is 0.9118397. One tree in this model is showed below. From the plot, we can see that deposit type Non-refund is used to make the first split in the tree.
Model performance
In this section, an assessment is done on how well the models are able to predict on the smaller test set that was created in an earlier stage. After making the predictions, confusion matrices were created for each model to see how they perform. The decision threshold was set to 0.42, as to calculate the correct Accuracy. Using the default will result in a higher Accuracy, but the models will be biased towards the majority class. This can lead to erroneous conclusions about the predictions made. Please find below the performance metrics for each model.
Overview of model performance
| Accuracy |
0.8137 |
0.8105 |
0.8272 |
0.8523 |
| Cohen’s kappa |
0.6076 |
0.6081 |
0.6385 |
0.6971 |
| Sensitivity |
0.8315 |
0.7842 |
0.8333 |
0.8168 |
| Specificity |
0.8041 |
0.7531 |
0.8237 |
0.8786 |
| Precision |
0.6941 |
0.7531 |
0.7324 |
0.8328 |
| Recall |
0.8315 |
0.7842 |
0.8333 |
0.8168 |
| F1 |
0.7566 |
0.7683 |
0.7796 |
0.8247 |
Based on the Accuracy, the boosted classification tree seems to perform the best, followed by the decision tree. The backward stepwise regression comes close behind, whilst the lasso regression has the lowest Accuracy. The other metrics also give a good indication of how well the models perform. The performance of the models will be visualized to get a better idea which model really performs the best.
ROC and AUC
When plotting the ROC curves for the models, it becomes more clear that the gradient boosting model indeed performs the best as its curve lies above all others. The decision tree seems to have a range where it performs worse than all other models.

Below are also the AUCs of the models. The boosted classification tree has a very good score and also has the highest one. The AUC of the lasso regression and backward stepwise regression are very close to one other, whereas the decision tree has the lowest AUC out of all. Thus, the decision tree doesn’t perform as good as the other models based on the AUC, and opposed to what the Accuracy indicated earlier.
| Model |
AUC |
| d_tree |
0.8693387 |
| lasso |
0.8909033 |
| step_glm |
0.8919130 |
| xgb |
0.9325870 |
Class separation
In the following figures, one can see the class distributions for each model. From these graphs, the gradient boosting model, lasso regression and backward stepwise regression seem to do a good job in separating the classes. The decision tree has some more overlap between the classes and thus, may not be able to separate the classes as good as the others. This is consistent with our former observations.

Cumulative gain
The figure below shows the cumulative gain charts for the models. The upper dashed grey line is the best possible scenario, and would mean that the classifier is able to rank the test cases perfectly. The dashed grey line underneath is the worst case scenario, and would mean that the classifier ranks randomnly. When comparing the lines of the models, the boosted classification tree performs best once again. The lines of the backward stepwise regression and lasso regression are almost identical.

In this section, the results showed that the boosted classification tree clearly outperforms the other models. Therefore, this model was selected for evaluation in the final part of the report.
Final model
The hotel owner of the Seaside hotel has provided a final test set. This section will reveal the performance of the selected model on the predictions for this test set. The final model was trained on the entire train data set first before making the predictions using the trained model. The ROC on the entire train data set is now 0.913.
The table below provides an overview of the performance metrics of the gradient boosting model, which made predictions on the final test set.
| Accuracy |
0.8504 |
| Cohen’s kappa |
0.6931 |
| Sensitivity |
0.8150 |
| Specificity |
0.8766 |
| Precision |
0.8298 |
| Recall |
0.8150 |
| F1 |
0.8223 |
The final Accuracy is 0.8504, which indicates that the model predicted quite well on the test set. Opposed to Accuracy, a high F1 score indicates low false positives and negatives. It is therefore also a measure of how well the model can correctly identify real threats, and is in this case - with a value of 0.822 - considered good. The sensitivity-specificity and precision-recall also reflect a good predictive ability of the model. Taking the AUC in consideration as well, it again shows an excellent score for the gradient boosting model:
| Model |
AUC |
| final_xgb |
0.9314404 |
As seen in the figure below, there is a clear distinction between the classes. It can therefore be concluded that this model is able to separate the classes well.

Below is the cumulative gain chart for the gradient boosting model on the final test set. This graph shows that the model can identify about 60% of the cancellations with just 25% of the test set.

Conclusion
This report looked at subset selection-, regularization-, decision tree- and ensemble methods to build a predictive model for the Seaside Hotel, as a means to forecast booking cancellations.
After training several different models on the train set and comparing their performance, it can be concluded that the gradient boosting model performed best out of all. Therefore, this model was selected to predict the test set. For this model, Yeo-Johnson transformations were performed on the predictors, and the total number of categories were reduced to a great extent. 10 fold cross-validation was used on the train set to find the optimal tuning parameters, and the one standard error rule was applied to find a simpler model which is not significantly worse than the absolute best. Eventually, the final gradient boosting model had 1000 iterations, an interaction depth of 3 and a shrinkage value of 0.01. The results indicated a good predictive performance of the model.
Ensembles such as boosting models are usually considered highly accurate compared to for example decision trees, which are prone to overfitting. This can explain the good performance of the gradient boosting model. The cost of this model is the limited interpretability. The other three models that were assessed in this report are in such case better. Good interpretability can provide a better understanding of the features that may or may not be good predictors of is_cancelled. Because of this, the three respective models were able to give more insight into the potential drivers of hotel cancellations. One of them is for example the lead time between the booking date and the arrival date. Bookings that are guaranteed with a full deposit (Non-refund) can give an early indication of whether a booking will be canceled or not. Furthermore, bookings made through an online travel agent also seemed to be a good predictor for booking cancellations. The same goes for previous cancellations and the number of special requests. These insights can help the hotel further in their marketing efforts. For example, by providing offers to those that are likely to cancel.
When it comes to staff and inventory planning however, the boosted classification tree will provide more useful and accurate predictions. Although the final gradient boosting model performs well, there are some points of improvement for future iterations. It is recommended to use more computing power to try more and different models, to find a better model. The selected values for the tuning parameters were on the edge of the grid. For example, an interaction depth of more than 3 could potentially lead to a better performance. As such, it is recommended to iterate on more intensive models.
Appendix
Preprocessing recipe
bookings_recipe <- recipe(is_cancelled ~ ., data = bookings_train_small) %>%
step_other(country, threshold = 0.01) %>%
step_other(meal, market_segment, deposit_type, customer_type, threshold = 0.01) %>%
step_dummy(all_nominal(), -is_cancelled) %>%
step_YeoJohnson(all_numeric())
bookings_recipe_prep <- bookings_recipe %>% prep(data = bookings_train_small)
bookings_train_baked <- bookings_recipe_prep %>% bake(new_data = bookings_train_small)
bookings_test_baked <- bookings_recipe_prep %>% bake(new_data = bookings_test_small)
bookings_final_train_baked <- bookings_recipe_prep %>% bake(new_data = bookings_train)
bookings_final_test_baked <- bookings_recipe_prep %>% bake(new_data = bookings_test_solutions)
Model 1: Backward stepwise selection
set.seed(1603)
step_glm <- train(is_cancelled ~ .,
data = bookings_train_baked,
method = "glmStepAIC",
family = binomial(),
trControl = ctrl,
metric = "ROC")
step_glm
Model 2: Lasso regression
tuneGrid_lasso <- expand.grid(alpha = 1, lambda = 10^(seq(from = -4, to = -2, length.out = 20)))
set.seed(1603)
lasso <- train(is_cancelled ~ .,
data = bookings_train_baked,
method = "glmnet",
family = "binomial",
trControl = ctrl,
tuneGrid = tuneGrid_lasso,
metric = "ROC")
lasso
coef(lasso$finalModel, 0.002335721)
54 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -155.07403234
lead_time 0.16096513
arrival_date_year 0.07501466
arrival_date_week_number .
arrival_date_day_of_month .
stays_in_weekend_nights 0.24790511
stays_in_week_nights 0.17868090
adults .
children .
babies 0.06065662
is_repeated_guest -0.03038507
previous_cancellations 1.31564758
previous_bookings_not_cancelled -0.18079370
booking_changes -0.33299026
days_in_waiting_list .
adr 0.04325121
required_car_parking_spaces -3.53693821
total_of_special_requests -3.00814300
arrival_date_month_August 0.03084553
arrival_date_month_December 0.07791437
arrival_date_month_February 0.17798813
arrival_date_month_January 0.11906021
arrival_date_month_July -0.16763701
arrival_date_month_June -0.12892701
arrival_date_month_March .
arrival_date_month_May -0.06518944
arrival_date_month_November .
arrival_date_month_October .
arrival_date_month_September -0.23836729
meal_None 0.26235646
meal_HB -0.13464881
meal_other 1.50366207
country_BEL -0.16127128
country_BRA 0.70032249
country_CHE -0.20307611
country_CHN 1.14338759
country_DEU -0.48600894
country_ESP 0.39858615
country_FRA -0.23968226
country_GBR .
country_IRL 0.12605389
country_ITA 0.56295118
country_NLD -0.02612917
country_PRT 1.64396345
country_USA .
country_other 0.34473001
market_segment_Offline.TA.TO -0.33081181
market_segment_Online.TA 1.09764995
market_segment_Other -0.26647306
deposit_type_Non.Refund 4.28731731
deposit_type_other .
customer_type_Transient.Party -0.64749844
customer_type_Contract -0.24161309
customer_type_other -0.29092735
Model 3: Decision tree
set.seed(1603)
d_tree <- train(is_cancelled ~ .,
data = bookings_train_baked,
method = "rpart",
tuneLength = 10,
trControl = ctrl,
metric = "ROC")
d_tree
Model 4: Boosted classification tree
tuneGrid_xgb <- expand.grid(eta = c(0.01, 0.1),
max_depth = 1:3,
colsample_bytree = 0.6,
subsample = 1,
nrounds = c(1000, 2500, 5000),
gamma = 0,
min_child_weight = 1)
set.seed(1603)
xgb <- train(is_cancelled ~ .,
data = bookings_train_baked,
method = "xgbTree",
trControl = ctrl,
tuneGrid = tuneGrid_xgb,
metric = "ROC")
xgb
Data summary
Data summary
|
|
|
|
Name
|
Piped data
|
|
Number of rows
|
23799
|
|
Number of columns
|
24
|
|
_______________________
|
|
|
Column type frequency:
|
|
|
character
|
2
|
|
factor
|
5
|
|
numeric
|
17
|
|
________________________
|
|
|
Group variables
|
None
|
Variable type: character
|
skim_variable
|
n_missing
|
complete_rate
|
min
|
max
|
empty
|
n_unique
|
whitespace
|
|
arrival_date_month
|
0
|
1
|
3
|
9
|
0
|
12
|
0
|
|
country
|
0
|
1
|
2
|
4
|
0
|
137
|
0
|
Variable type: factor
|
skim_variable
|
n_missing
|
complete_rate
|
ordered
|
n_unique
|
top_counts
|
|
is_cancelled
|
0
|
1
|
FALSE
|
2
|
no: 13869, yes: 9930
|
|
meal
|
0
|
1
|
FALSE
|
4
|
BB: 18714, Non: 3089, HB: 1983, FB: 13
|
|
market_segment
|
0
|
1
|
FALSE
|
4
|
Onl: 11526, Off: 5137, Gro: 4217, Oth: 2919
|
|
deposit_type
|
0
|
1
|
FALSE
|
3
|
No : 19886, Non: 3909, Ref: 4
|
|
customer_type
|
0
|
1
|
FALSE
|
4
|
Tra: 17770, Tra: 5242, Con: 701, Gro: 86
|
Variable type: numeric
|
skim_variable
|
n_missing
|
complete_rate
|
mean
|
sd
|
p0
|
p25
|
p50
|
p75
|
p100
|
hist
|
|
lead_time
|
0
|
1
|
109.00
|
111.05
|
0
|
22.0
|
73.00
|
162
|
629
|
▇▂▁▁▁
|
|
arrival_date_year
|
0
|
1
|
2016.17
|
0.70
|
2015
|
2016.0
|
2016.00
|
2017
|
2017
|
▃▁▇▁▆
|
|
arrival_date_week_number
|
0
|
1
|
27.24
|
13.46
|
1
|
17.0
|
27.00
|
38
|
53
|
▅▇▇▇▅
|
|
arrival_date_day_of_month
|
0
|
1
|
15.84
|
8.72
|
1
|
8.0
|
16.00
|
23
|
31
|
▇▇▇▇▆
|
|
stays_in_weekend_nights
|
0
|
1
|
0.79
|
0.89
|
0
|
0.0
|
1.00
|
2
|
16
|
▇▁▁▁▁
|
|
stays_in_week_nights
|
0
|
1
|
2.18
|
1.46
|
0
|
1.0
|
2.00
|
3
|
41
|
▇▁▁▁▁
|
|
adults
|
0
|
1
|
1.85
|
0.51
|
0
|
2.0
|
2.00
|
2
|
4
|
▁▂▇▁▁
|
|
children
|
0
|
1
|
0.09
|
0.36
|
0
|
0.0
|
0.00
|
0
|
3
|
▇▁▁▁▁
|
|
babies
|
0
|
1
|
0.01
|
0.07
|
0
|
0.0
|
0.00
|
0
|
2
|
▇▁▁▁▁
|
|
is_repeated_guest
|
0
|
1
|
0.03
|
0.16
|
0
|
0.0
|
0.00
|
0
|
1
|
▇▁▁▁▁
|
|
previous_cancellations
|
0
|
1
|
0.08
|
0.41
|
0
|
0.0
|
0.00
|
0
|
13
|
▇▁▁▁▁
|
|
previous_bookings_not_cancelled
|
0
|
1
|
0.13
|
1.78
|
0
|
0.0
|
0.00
|
0
|
70
|
▇▁▁▁▁
|
|
booking_changes
|
0
|
1
|
0.18
|
0.60
|
0
|
0.0
|
0.00
|
0
|
18
|
▇▁▁▁▁
|
|
days_in_waiting_list
|
0
|
1
|
3.35
|
21.27
|
0
|
0.0
|
0.00
|
0
|
391
|
▇▁▁▁▁
|
|
adr
|
0
|
1
|
105.42
|
52.34
|
0
|
79.2
|
99.45
|
126
|
5400
|
▇▁▁▁▁
|
|
required_car_parking_spaces
|
0
|
1
|
0.02
|
0.15
|
0
|
0.0
|
0.00
|
0
|
3
|
▇▁▁▁▁
|
|
total_of_special_requests
|
0
|
1
|
0.54
|
0.78
|
0
|
0.0
|
0.00
|
1
|
5
|
▇▁▁▁▁
|
Multicollinearity

Class distributions - Numeric variables
Data summary
|
|
|
|
Name
|
Piped data
|
|
Number of rows
|
23799
|
|
Number of columns
|
24
|
|
_______________________
|
|
|
Column type frequency:
|
|
|
character
|
2
|
|
factor
|
4
|
|
numeric
|
17
|
|
________________________
|
|
|
Group variables
|
is_cancelled
|
Variable type: character
|
skim_variable
|
is_cancelled
|
n_missing
|
complete_rate
|
min
|
max
|
empty
|
n_unique
|
whitespace
|
|
arrival_date_month
|
no
|
0
|
1
|
3
|
9
|
0
|
12
|
0
|
|
arrival_date_month
|
yes
|
0
|
1
|
3
|
9
|
0
|
12
|
0
|
|
country
|
no
|
0
|
1
|
2
|
4
|
0
|
118
|
0
|
|
country
|
yes
|
0
|
1
|
2
|
4
|
0
|
106
|
0
|
Variable type: factor
|
skim_variable
|
is_cancelled
|
n_missing
|
complete_rate
|
ordered
|
n_unique
|
top_counts
|
|
meal
|
no
|
0
|
1
|
FALSE
|
4
|
BB: 10672, Non: 1964, HB: 1230, FB: 3
|
|
meal
|
yes
|
0
|
1
|
FALSE
|
4
|
BB: 8042, Non: 1125, HB: 753, FB: 10
|
|
market_segment
|
no
|
0
|
1
|
FALSE
|
4
|
Onl: 7247, Off: 2928, Oth: 2406, Gro: 1288
|
|
market_segment
|
yes
|
0
|
1
|
FALSE
|
4
|
Onl: 4279, Gro: 2929, Off: 2209, Oth: 513
|
|
deposit_type
|
no
|
0
|
1
|
FALSE
|
3
|
No : 13861, Non: 6, Ref: 2
|
|
deposit_type
|
yes
|
0
|
1
|
FALSE
|
3
|
No : 6025, Non: 3903, Ref: 2
|
|
customer_type
|
no
|
0
|
1
|
FALSE
|
4
|
Tra: 9682, Tra: 3727, Con: 384, Gro: 76
|
|
customer_type
|
yes
|
0
|
1
|
FALSE
|
4
|
Tra: 8088, Tra: 1515, Con: 317, Gro: 10
|
Variable type: numeric
|
skim_variable
|
is_cancelled
|
n_missing
|
complete_rate
|
mean
|
sd
|
p0
|
p25
|
p50
|
p75
|
p100
|
hist
|
|
lead_time
|
no
|
0
|
1
|
79.32
|
88.98
|
0
|
11.0
|
49
|
117.0
|
518
|
▇▂▁▁▁
|
|
lead_time
|
yes
|
0
|
1
|
150.46
|
124.71
|
0
|
49.0
|
115
|
229.0
|
629
|
▇▃▂▁▁
|
|
arrival_date_year
|
no
|
0
|
1
|
2016.17
|
0.69
|
2015
|
2016.0
|
2016
|
2017.0
|
2017
|
▃▁▇▁▆
|
|
arrival_date_year
|
yes
|
0
|
1
|
2016.17
|
0.71
|
2015
|
2016.0
|
2016
|
2017.0
|
2017
|
▃▁▇▁▆
|
|
arrival_date_week_number
|
no
|
0
|
1
|
27.25
|
13.65
|
1
|
16.0
|
28
|
38.0
|
53
|
▅▇▇▇▅
|
|
arrival_date_week_number
|
yes
|
0
|
1
|
27.22
|
13.19
|
1
|
17.0
|
27
|
38.0
|
53
|
▅▇▇▇▅
|
|
arrival_date_day_of_month
|
no
|
0
|
1
|
15.80
|
8.73
|
1
|
8.0
|
16
|
23.0
|
31
|
▇▇▇▇▆
|
|
arrival_date_day_of_month
|
yes
|
0
|
1
|
15.90
|
8.70
|
1
|
8.0
|
16
|
23.0
|
31
|
▇▇▇▇▆
|
|
stays_in_weekend_nights
|
no
|
0
|
1
|
0.80
|
0.87
|
0
|
0.0
|
1
|
2.0
|
16
|
▇▁▁▁▁
|
|
stays_in_weekend_nights
|
yes
|
0
|
1
|
0.78
|
0.92
|
0
|
0.0
|
1
|
1.0
|
10
|
▇▁▁▁▁
|
|
stays_in_week_nights
|
no
|
0
|
1
|
2.13
|
1.42
|
0
|
1.0
|
2
|
3.0
|
41
|
▇▁▁▁▁
|
|
stays_in_week_nights
|
yes
|
0
|
1
|
2.26
|
1.52
|
0
|
1.0
|
2
|
3.0
|
21
|
▇▁▁▁▁
|
|
adults
|
no
|
0
|
1
|
1.83
|
0.53
|
0
|
2.0
|
2
|
2.0
|
4
|
▁▂▇▁▁
|
|
adults
|
yes
|
0
|
1
|
1.88
|
0.47
|
0
|
2.0
|
2
|
2.0
|
4
|
▁▂▇▁▁
|
|
children
|
no
|
0
|
1
|
0.09
|
0.37
|
0
|
0.0
|
0
|
0.0
|
3
|
▇▁▁▁▁
|
|
children
|
yes
|
0
|
1
|
0.08
|
0.35
|
0
|
0.0
|
0
|
0.0
|
3
|
▇▁▁▁▁
|
|
babies
|
no
|
0
|
1
|
0.01
|
0.09
|
0
|
0.0
|
0
|
0.0
|
2
|
▇▁▁▁▁
|
|
babies
|
yes
|
0
|
1
|
0.00
|
0.05
|
0
|
0.0
|
0
|
0.0
|
1
|
▇▁▁▁▁
|
|
is_repeated_guest
|
no
|
0
|
1
|
0.04
|
0.18
|
0
|
0.0
|
0
|
0.0
|
1
|
▇▁▁▁▁
|
|
is_repeated_guest
|
yes
|
0
|
1
|
0.01
|
0.12
|
0
|
0.0
|
0
|
0.0
|
1
|
▇▁▁▁▁
|
|
previous_cancellations
|
no
|
0
|
1
|
0.02
|
0.33
|
0
|
0.0
|
0
|
0.0
|
13
|
▇▁▁▁▁
|
|
previous_cancellations
|
yes
|
0
|
1
|
0.16
|
0.49
|
0
|
0.0
|
0
|
0.0
|
13
|
▇▁▁▁▁
|
|
previous_bookings_not_cancelled
|
no
|
0
|
1
|
0.20
|
2.24
|
0
|
0.0
|
0
|
0.0
|
70
|
▇▁▁▁▁
|
|
previous_bookings_not_cancelled
|
yes
|
0
|
1
|
0.03
|
0.75
|
0
|
0.0
|
0
|
0.0
|
48
|
▇▁▁▁▁
|
|
booking_changes
|
no
|
0
|
1
|
0.25
|
0.70
|
0
|
0.0
|
0
|
0.0
|
18
|
▇▁▁▁▁
|
|
booking_changes
|
yes
|
0
|
1
|
0.08
|
0.40
|
0
|
0.0
|
0
|
0.0
|
6
|
▇▁▁▁▁
|
|
days_in_waiting_list
|
no
|
0
|
1
|
2.09
|
17.15
|
0
|
0.0
|
0
|
0.0
|
379
|
▇▁▁▁▁
|
|
days_in_waiting_list
|
yes
|
0
|
1
|
5.10
|
25.85
|
0
|
0.0
|
0
|
0.0
|
391
|
▇▁▁▁▁
|
|
adr
|
no
|
0
|
1
|
105.85
|
40.91
|
0
|
80.0
|
100
|
126.0
|
510
|
▇▇▁▁▁
|
|
adr
|
yes
|
0
|
1
|
104.84
|
65.02
|
0
|
75.9
|
99
|
125.8
|
5400
|
▇▁▁▁▁
|
|
required_car_parking_spaces
|
no
|
0
|
1
|
0.04
|
0.20
|
0
|
0.0
|
0
|
0.0
|
3
|
▇▁▁▁▁
|
|
required_car_parking_spaces
|
yes
|
0
|
1
|
0.00
|
0.00
|
0
|
0.0
|
0
|
0.0
|
0
|
▁▁▇▁▁
|
|
total_of_special_requests
|
no
|
0
|
1
|
0.74
|
0.84
|
0
|
0.0
|
1
|
1.0
|
5
|
▇▂▁▁▁
|
|
total_of_special_requests
|
yes
|
0
|
1
|
0.27
|
0.59
|
0
|
0.0
|
0
|
0.0
|
5
|
▇▁▁▁▁
|
Categorical variables
| country |
n |
| AGO |
109 |
| AIA |
1 |
| ALB |
3 |
| AND |
1 |
| ARE |
14 |
| ARG |
50 |
| ARM |
3 |
| ATF |
1 |
| AUS |
91 |
| AUT |
315 |
| arrival_date_month |
n |
| April |
2208 |
| August |
2684 |
| December |
1308 |
| February |
1494 |
| January |
1122 |
| July |
2416 |
| June |
2334 |
| March |
1975 |
| May |
2462 |
| November |
1305 |
| October |
2284 |
| September |
2207 |
| meal |
n |
| BB |
18714 |
| None |
3089 |
| HB |
1983 |
| FB |
13 |
| market_segment |
n |
| Groups |
4217 |
| Offline TA/TO |
5137 |
| Online TA |
11526 |
| Other |
2919 |
| deposit_type |
n |
| No Deposit |
19886 |
| Non Refund |
3909 |
| Refundable |
4 |
| customer_type |
n |
| Transient |
17770 |
| Transient-Party |
5242 |
| Contract |
701 |
| Group |
86 |
---
title: "The Seaside Hotel"
author: "Tien Nguyen 480494"
date: "May 16, 2021"
output:
  html_document:
    df_print: paged
  html_notebook:
    number_sections: yes
  pdf_document: default
subtitle: Predictive modelling for hotel booking cancellations
editor_options:
  chunk_output_type: inline
  markdown:
    wrap: sentence
---

```{r include=FALSE}
library("tidyverse")
library("dplyr")
library("knitr")
library("skimr")
library("corrplot")
library("caret")
library("recipes")
library("leaps")
library("glmnet")
library("rpart")
library("rpart.plot")
library("randomForest")
library("gbm")
library("xgboost")
library("kableExtra")
```

# Introduction

The Seaside Hotel has asked to build a machine learning model which can predict booking cancellations based on the available data. This report aims to identify the best model by comparing different models and their performance. A final model will be recommended, which the hotel can implement and deploy, as a means to optimize the planning of staff, inventory and marketing efforts.

# Data understanding

```{r include=FALSE}
load("bookings_train.RData")
load("bookings_test.RData")
load("bookings_test_solutions.RData")
```

The Seaside Hotel has provided a train data set on which several models can be trained and evaluated. The target variable that should be predicted is `is_cancelled`. This target variable contains two classes: `yes` and `no`. The data shows that there is some imbalance in these classes, around 42% and 58% respectively. Therefore, the AUC metric is used instead of Accuracy to select models during model training. The class imbalance is also important to take into consideration when making the class predictions.

```{r include=FALSE}
bookings_train %>% 
  count(is_cancelled) %>% 
  mutate(prop = n / nrow(bookings_train))
```

The entire data set is split into two smaller sets, namely a train set of 60% and a test set of 40%.
Stratified splitting was used for this, meaning that the proportions of the classes of the target variable in the sets are similar to that of the entire data set. There are 23799 observations in the train set and 15864 observations in the test set. For now, the test set is set aside and will only be used after the models have been trained.

```{r include=FALSE}
set.seed(1603)
train_ind <- createDataPartition(bookings_train$is_cancelled, p = 0.6)
bookings_train_small <- bookings_train %>% dplyr::slice(train_ind$Resample1)
bookings_test_small <- bookings_train %>% dplyr::slice(-train_ind$Resample1)
```

When exploring the data, some insights could be derived. The data contains no metadata nor variables that could potentially introduce data leakage. Also no major issues with multicollinearity could be found when assessing whether there are predictors which are highly correlated with one other (Appendix 6.7). Thus, no features are removed beforehand.

After looking at the distributions of the classes of the target variable for the numerical variables, one can already get an idea of what could be potentially promising features (Appendix 6.7). For example, the lead time and number of required parking spaces could possibly be good predictors for whether a person will cancel a booking or not, due to the good separation of classes. Despite these initial insights, all features are considered when building the predictive models.

When looking at the categorical variables in the data (Appendix 6.8), mainly `country` seems to have many categories (137 in total). Thus, multiple categories will be consolidated to significantly reduce the number of infrequent categories. Moreover, `meal`, `market_segment`, `deposit_type` and `customer_type` each have a category with a relatively low frequency (p \< 0.01). These categories will also be combined with another category, as to prevent potential problems when resampling.

# Model selection

To determine what the best predictive model is, different models are trained using a train set, after which their performance is evaluated using the test set.

As mentioned earlier, preprocessing is first done by merging some categories together and then dummy encoding all categorical variables after. Yeo-Johnson transformations were applied to numerical variables, since they also contain zero's. This will make features that are skew, less skew. The recipe, containing these preprocessing steps, was prepared on the train set and then baked on the train set as well as on the test set. No normalization was needed for the machine learning models that were trained in this report. The recipe can be found in Appendix 6.1.

## Model training

This section will dive deeper into the trained models. The models were trained using the smaller train set. See Appendix 6.2-6.5 for the code of the models.

Cross-validation was performed besides using a separate test set for model assessment. Where applicable, the one standard error rule is applied to select the value of the tuning parameter. To provide an overview, the following settings were used to train the models:

|                       |                         |
|-----------------------|-------------------------|
| **Resampling method** | Cross-validation        |
| **Number of folds**   | 10                      |
| **Selection method**  | One standard error rule |
| **Metric**            | AUC                     |

```{r include=FALSE}
ctrl <- trainControl(method = "cv", 
                     number = 10, 
                     selectionFunction = "oneSE", 
                     classProbs = TRUE, 
                     summaryFunction = twoClassSummary)
```

**Backward stepwise regression**

A glm model with backward stepwise feature selection was trained, using 10-fold cross-validation. The AIC indicated which predictors make the model more complicated. The final model resulting from this training had a ROC of 0.891. The 10 most important variables according to this model can be seen below.

```{r echo=FALSE}
varImp(step_glm$finalModel) %>% arrange(desc(Overall)) %>% head(10) %>% kable() %>% kable_styling() 
```

This shows that the most important predictor for whether a booking will be canceled or not, is the number of special requests. The lead time is also a good predictor, as could also be seen earlier during data exploration. Moreover, the countries Portugal, Italy and China also seem to be good predictors for `is_cancelled`.

**Lasso regression**

Next, a lasso regression was trained, also using 10-fold cross validation. This model is also able to identify (ir)relevant variables by shrinking those that don't have a lot of influence on `is_cancelled` towards 0. The optimal model was selected using the one SE rule and had a lambda value of 0.0023357 and ROC of 0.889. The selected value of lambda can be seen in the following graph as well:

```{r echo=FALSE}
ggplot(lasso) +
  geom_vline(xintercept = lasso$bestTune$lambda, 
              colour = "red", linetype = 2) +
   scale_x_log10()
```
The estimated coefficients of the final models showed that 11 variables out of the 53 were set to 0. Months such as March, October and November were removed. Reason for that could be because these months fall in the low season of the hotel industry. When looking at the importance of variables, a few conclusions can be made. In this model, the deposit type Non-refund is the most important predictor. This predictor was also in the top 5 of most important variables in the backward stepwise regression. Furthermore, the total number of special requests, Portugal, China, previous cancellations and the online travel agent market segment are considered important in both models when comparing them in terms of variable importance. As such, there are a good number of predictors on which both models agree that they are relevant in predicting hotel bookings cancellations.

```{r echo=FALSE}
varImp(lasso$finalModel) %>% arrange(desc(Overall)) %>% head(10) %>% kable() %>% kable_styling() 
```

**Decision tree**

A decision tree was build and trained using 10-fold cross validation. The selected model had a value of 0.003726 for cp with a ROC of 0.875. Pruning was performed using the one standard error rule. The decision tree looks as follows:

```{r echo=FALSE}
prp(d_tree$finalModel)
```

The variable importance in this model can be seen in the table below. Again, consistent with previous models `deposit_type_Non.Refund`, `total_of_special_requests`, `country_PRT`, `lead_time`, `previous_cancellations`, `market_segment_Online.TA` and `customer_type_Transient.Party` seem to be important predictors.

```{r echo=FALSE}
varImp(d_tree$finalModel) %>% arrange(desc(Overall)) %>% head(10) %>% kable() %>% kable_styling() 
```

**Boosted classification tree**

Lastly, a boosted classification tree was trained using 10-fold cross validation. The tuning grid to train the model was as follows:

| Tuning parameters     |  Values                 |
|-----------------------|-------------------------|
| Boosting iterations   | 1000, 2500, 3000        |
| Interaction depth     | 1, 2, 3                 |
| Shrinkage             | 0.01, 0.1               |

A summary of the training results can be seen below:

```{r echo=FALSE}
plot(xgb)
```

The final model that was eventually selected based on the one SE rule has 1000 boosting iterations, an interaction depth of 3, and a shrinkage parameter of 0.01. The corresponding ROC is 0.9118397. One tree in this model is showed below. From the plot, we can see that deposit type Non-refund is used to make the first split in the tree.

```{r echo=FALSE}
xgb.plot.tree(model = xgb$finalModel, trees = 1)
```

## Model performance

In this section, an assessment is done on how well the models are able to predict on the smaller test set that was created in an earlier stage. After making the predictions, confusion matrices were created for each model to see how they perform. The decision threshold was set to 0.42, as to calculate the correct Accuracy. Using the default will result in a higher Accuracy, but the models will be biased towards the majority class. This can lead to erroneous conclusions about the predictions made. Please find below the performance metrics for each model.

```{r include=FALSE}
confusion_matrix <- function(actual, predictions, threshold) {
 classes <- cut(predictions, 
                breaks = c(-Inf, threshold, Inf), 
                labels = levels(actual))
 res <- table(Actual = actual, Prediction = classes)
 res
}
```
                          
```{r include=FALSE}
step_preds <- predict(step_glm, newdata = bookings_test_baked, type = "prob")

step_cm <- confusion_matrix(actual = bookings_test_baked$is_cancelled, 
                            predictions = step_preds$yes, threshold = 0.42)

confusionMatrix(step_cm, positive = "yes", mode = "everything")
```

```{r include=FALSE}
lasso_preds <- predict(lasso, newdata = bookings_test_baked, type = "prob")

lasso_cm <- confusion_matrix(actual = bookings_test_baked$is_cancelled, 
                            predictions = lasso_preds$yes, threshold = 0.42)

confusionMatrix(lasso_cm, positive = "yes", mode = "everything")
```

```{r include=FALSE}
d_tree_preds <- predict(d_tree, newdata = bookings_test_baked, type = "prob")

d_tree_cm <- confusion_matrix(actual = bookings_test_baked$is_cancelled, 
                            predictions = d_tree_preds$yes, threshold = 0.42)

confusionMatrix(d_tree_cm, positive = "yes", mode = "everything")
```

```{r include=FALSE}
xgb_preds <- predict(xgb, newdata = bookings_test_baked, type = "prob")

xgb_cm <- confusion_matrix(actual = bookings_test_baked$is_cancelled, 
                            predictions = xgb_preds$yes, threshold = 0.42)

confusionMatrix(xgb_cm, positive = "yes", mode = "everything")
```

**Overview of model performance**

|| Backward stepwise regression | Lasso regression | Decision tree | Boosted classification tree |
|------------------|--------------------|-----------------|------------------|------------------|
| Accuracy         | 0.8137             | 0.8105          | 0.8272           | 0.8523           |
| Cohen's kappa    | 0.6076             | 0.6081          | 0.6385           | 0.6971           |
| Sensitivity      | 0.8315             | 0.7842          | 0.8333           | 0.8168           |
| Specificity      | 0.8041             | 0.7531          | 0.8237           | 0.8786           |
| Precision        | 0.6941             | 0.7531          | 0.7324           | 0.8328           |
| Recall           | 0.8315             | 0.7842          | 0.8333           | 0.8168           |
| F1               | 0.7566             | 0.7683          | 0.7796           | 0.8247           |

Based on the Accuracy, the boosted classification tree seems to perform the best, followed by the decision tree. The backward stepwise regression comes close behind, whilst the lasso regression has the lowest Accuracy. The other metrics also give a good indication of how well the models perform.  The performance of the models will be visualized to get a better idea which model really performs the best.

**ROC and AUC**

When plotting the ROC curves for the models, it becomes more clear that the gradient boosting model indeed performs the best as its curve lies above all others. The decision tree seems to have a range where it performs worse than all other models.

```{r echo=FALSE}
source("calc_confusion.R")
conf_df <- calc_confusion(actual = bookings_test_baked$is_cancelled,
                          positive = "yes", 
                          step_glm = step_preds$yes, lasso = lasso_preds$yes, d_tree = d_tree_preds$yes, xgb = xgb_preds$yes)

conf_df <- conf_df %>% 
  mutate(FPR = FP / (TN + FP), TPR = TP / (TP + FN))

ggplot(conf_df, aes(x = FPR, y = TPR, colour = Model)) +
  geom_path() + 
  geom_abline(colour = "grey", linetype = 2) + 
  coord_equal() 
```
Below are also the AUCs of the models. The boosted classification tree has a very good score and also has the highest one. The AUC of the lasso regression and backward stepwise regression are very close to one other, whereas the decision tree has the lowest AUC out of all. Thus, the decision tree doesn't perform as good as the other models based on the AUC, and opposed to what the Accuracy indicated earlier.

```{r echo=FALSE}
conf_df <- conf_df %>% 
  mutate(trap_width = lag(FPR) - FPR,
         trap_height = 0.5 * (lag(TPR) + TPR),
         trap_area = trap_width * trap_height)

conf_df %>% summarize(AUC = sum(trap_area, na.rm = TRUE)) %>% kable() %>% kable_styling()
```

**Class separation**

In the following figures, one can see the class distributions for each model. From these graphs, the gradient boosting model, lasso regression and backward stepwise regression seem to do a good job in separating the classes. The decision tree has some more overlap between the classes and thus, may not be able to separate the classes as good as the others. This is consistent with our former observations.

```{r echo=FALSE}
preds <- tibble(Actual = bookings_test_baked$is_cancelled,
                step_glm = step_preds$yes, lasso = lasso_preds$yes, d_tree = d_tree_preds$yes, xgb = xgb_preds$yes)
preds <- preds %>%  pivot_longer(cols = -Actual, names_to = "Model", values_to = "Probability") %>% 
  arrange(Model)

ggplot(preds, aes(x = Probability, fill = Actual)) + 
  geom_density(alpha = 0.5) + 
  facet_wrap(~ Model, scales = "free_y")
```

**Cumulative gain**

The figure below shows the cumulative gain charts for the models. The upper dashed grey line is the best possible scenario, and would mean that the classifier is able to rank the test cases perfectly. The dashed grey line underneath is the worst case scenario, and would mean that the classifier ranks randomnly. When comparing the lines of the models, the boosted classification tree performs best once again. The lines of the backward stepwise regression and lasso regression are almost identical. 

```{r echo=FALSE}
lift_preds <- tibble(is_cancelled = bookings_test_baked$is_cancelled,
                     step_glm = step_preds$yes,
                     lasso = lasso_preds$yes,
                     d_tree = d_tree_preds$yes,
                     xgb = xgb_preds$yes)
models_lift <- lift(is_cancelled ~ step_glm + lasso + d_tree + xgb, data = lift_preds,
                  class = "yes")
ggplot(models_lift) + coord_equal()
```

In this section, the results showed that the boosted classification tree clearly outperforms the other models. Therefore, this model was selected for evaluation in the final part of the report.

# Final model

The hotel owner of the Seaside hotel has provided a final test set. This section will reveal the performance of the selected model on the predictions for this test set. The final model was trained on the entire train data set first before making the predictions using the trained model. The ROC on the entire train data set is now 0.913.

```{r include=FALSE}
tuneGrid_xgb_final <- expand.grid(eta = c(0.01), 
                         max_depth = 3, 
                         colsample_bytree = 0.6, 
                         subsample = 1, 
                         nrounds = c(1000), 
                         gamma = 0,
                         min_child_weight = 1)
```

```{r include=FALSE}
set.seed(1603)
final_xgb <- train(is_cancelled ~ ., 
             data = bookings_final_train_baked, 
             method = "xgbTree", 
             trControl = ctrl,
             tuneGrid = tuneGrid_xgb_final,
             metric = "ROC")
final_xgb
```

```{r include=FALSE}
final_preds <- predict(xgb, newdata = bookings_final_test_baked, type = "prob")

final_cm <- confusion_matrix(actual = bookings_final_test_baked$is_cancelled, 
                            predictions = final_preds$yes, threshold = 0.42)

confusionMatrix(final_cm, positive = "yes", mode = "everything")
```

The table below provides an overview of the performance metrics of the gradient boosting model, which made predictions on the final test set. 

|                       |                         |
|-----------------------|-------------------------|
| **Accuracy**          | 0.8504                  |
| **Cohen's kappa**     | 0.6931                  |
| **Sensitivity**       | 0.8150                  |
| **Specificity**       | 0.8766                  |
| **Precision**         | 0.8298                  |
| **Recall**            | 0.8150                  |
| **F1**                | 0.8223                  |

The final Accuracy is 0.8504, which indicates that the model predicted quite well on the test set. Opposed to Accuracy, a high F1 score indicates low false positives and negatives. It is therefore also a measure of how well the model can correctly identify real threats, and is in this case - with a value of 0.822 - considered good. The sensitivity-specificity and precision-recall also reflect a good predictive ability of the model. Taking the AUC in consideration as well, it again shows an excellent score for the gradient boosting model:

```{r echo=FALSE}
final_conf_df <- calc_confusion(actual = bookings_final_test_baked$is_cancelled,
                          positive = "yes", 
                          final_xgb = final_preds$yes)

final_conf_df <- final_conf_df %>% 
  mutate(FPR = FP / (TN + FP), TPR = TP / (TP + FN))

final_conf_df <- final_conf_df %>% 
  mutate(trap_width = lag(FPR) - FPR,
         trap_height = 0.5 * (lag(TPR) + TPR),
         trap_area = trap_width * trap_height)

final_conf_df %>% summarize(AUC = sum(trap_area, na.rm = TRUE)) %>% kable() %>% kable_styling()
```
As seen in the figure below, there is a clear distinction between the classes. It can therefore be concluded that this model is able to separate the classes well. 

```{r echo=FALSE}
final_preds2 <- tibble(Actual = bookings_final_test_baked$is_cancelled,
                final_xgb = final_preds$yes)
final_preds2 <- final_preds2 %>%  pivot_longer(cols = -Actual, names_to = "Model", values_to = "Probability") %>% 
  arrange(Model)

ggplot(final_preds2, aes(x = Probability, fill = Actual)) + 
  geom_density(alpha = 0.5) + 
  facet_wrap(~ Model, scales = "free_y")
```

Below is the cumulative gain chart for the gradient boosting model on the final test set. This graph shows that the model can identify about 60% of the cancellations with just 25% of the test set.

```{r echo=FALSE}
lift_final_preds <- tibble(is_cancelled = bookings_final_test_baked$is_cancelled,
                           Probabilities = final_preds$yes)
lift_final <- lift(is_cancelled ~ Probabilities, data = lift_final_preds,
                  class = "yes")
ggplot(lift_final) + coord_equal()
```

# Conclusion

This report looked at subset selection-, regularization-, decision tree- and ensemble methods to build a predictive model for the Seaside Hotel, as a means to forecast booking cancellations. 

After training several different models on the train set and comparing their performance, it can be concluded that the gradient boosting model performed best out of all. Therefore, this model was selected to predict the test set. For this model, Yeo-Johnson transformations were performed on the predictors, and the total number of categories were reduced to a great extent. 10 fold cross-validation was used on the train set to find the optimal tuning parameters, and the one standard error rule was applied to find a simpler model which is not significantly worse than the absolute best. Eventually, the final gradient boosting model had 1000 iterations, an interaction depth of 3 and a shrinkage value of 0.01. The results indicated a good predictive performance of the model.

Ensembles such as boosting models are usually considered highly accurate compared to for example decision trees, which are prone to overfitting. This can explain the good performance of the gradient boosting model. The cost of this model is the limited interpretability. The other three models that were assessed in this report are in such case better. Good interpretability can provide a better understanding of the features that may or may not be good predictors of `is_cancelled`. Because of this, the three respective models were able to give more insight into the potential drivers of hotel cancellations. One of them is for example the lead time between the booking date and the arrival date. Bookings that are guaranteed with a full deposit (Non-refund) can give an early indication of whether a booking will be canceled or not. Furthermore, bookings made through an online travel agent also seemed to be a good predictor for booking cancellations. The same goes for previous cancellations and the number of special requests. These insights can help the hotel further in their marketing efforts. For example, by providing offers to those that are likely to cancel.

When it comes to staff and inventory planning however, the boosted classification tree will provide more useful and accurate predictions. Although the final gradient boosting model performs well, there are some points of improvement for future iterations. It is recommended to use more computing power to try more and different models, to find a better model. The selected values for the tuning parameters were on the edge of the grid. For example, an interaction depth of more than 3 could potentially lead to a better performance. As such, it is recommended to iterate on more intensive models. 

# Appendix

## Preprocessing recipe

```{r}
bookings_recipe <- recipe(is_cancelled ~ ., data = bookings_train_small) %>%
  step_other(country, threshold = 0.01) %>%
  step_other(meal, market_segment, deposit_type, customer_type, threshold = 0.01) %>%
  step_dummy(all_nominal(), -is_cancelled) %>%
  step_YeoJohnson(all_numeric())

bookings_recipe_prep <- bookings_recipe %>% prep(data = bookings_train_small)

bookings_train_baked <- bookings_recipe_prep %>% bake(new_data = bookings_train_small)
bookings_test_baked <- bookings_recipe_prep %>% bake(new_data = bookings_test_small)
bookings_final_train_baked <- bookings_recipe_prep %>% bake(new_data = bookings_train)
bookings_final_test_baked <- bookings_recipe_prep %>% bake(new_data = bookings_test_solutions)
```

## Model 1: Backward stepwise selection

```{r}
set.seed(1603)
step_glm <- train(is_cancelled ~ ., 
                  data = bookings_train_baked, 
                  method = "glmStepAIC", 
                  family = binomial(),
                  trControl = ctrl,
                  metric = "ROC")
step_glm
```

## Model 2: Lasso regression

```{r}
tuneGrid_lasso <- expand.grid(alpha = 1, lambda = 10^(seq(from = -4, to = -2, length.out = 20)))
```

```{r}
set.seed(1603)
lasso <- train(is_cancelled ~ ., 
               data = bookings_train_baked, 
               method = "glmnet", 
               family = "binomial", 
               trControl = ctrl, 
               tuneGrid = tuneGrid_lasso,
               metric = "ROC")
lasso
```

```{r}
coef(lasso$finalModel, 0.002335721)
```

## Model 3: Decision tree

```{r}
set.seed(1603)
d_tree <- train(is_cancelled ~ ., 
              data = bookings_train_baked, 
              method = "rpart",
              tuneLength = 10,
              trControl = ctrl,
              metric = "ROC")
d_tree
```

## Model 4: Boosted classification tree

```{r}
tuneGrid_xgb <- expand.grid(eta = c(0.01, 0.1), 
                         max_depth = 1:3, 
                         colsample_bytree = 0.6, 
                         subsample = 1, 
                         nrounds = c(1000, 2500, 5000), 
                         gamma = 0,
                         min_child_weight = 1)
```

```{r}
set.seed(1603)
xgb <- train(is_cancelled ~ ., 
             data = bookings_train_baked, 
             method = "xgbTree", 
             trControl = ctrl,
             tuneGrid = tuneGrid_xgb,
             metric = "ROC")
xgb
```

## Data summary

```{r echo=FALSE}
bookings_train_small %>%
  skim() %>%
  knit_print()
```

## Multicollinearity

```{r echo=FALSE, warning=FALSE}
bookings_train_small %>%
  select(is.numeric) %>%
  cor() %>%
  corrplot(tl.cex = 0.5)
```

## Class distributions - Numeric variables

```{r echo=FALSE}
bookings_train_small %>%
  group_by(is_cancelled) %>%
  skim() %>%
  knit_print()
```

## Categorical variables

```{r echo=FALSE}
bookings_train_small %>% count(country) %>% head(10) %>% kable() %>% kable_styling()
bookings_train_small %>% count(arrival_date_month) %>% kable() %>% kable_styling()
bookings_train_small %>% count(meal) %>% kable() %>% kable_styling()
bookings_train_small %>% count(market_segment) %>% kable() %>% kable_styling()
bookings_train_small %>% count(deposit_type) %>% kable() %>% kable_styling()
bookings_train_small %>% count(customer_type) %>% kable() %>% kable_styling()
```
